feat(cache): KEP-2655: Adding cache initializer #2793

akshaychitneni · 2025-08-20T23:42:26Z

What this PR does / why we need it:
Adds dataset initializer to bootstrap cache

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes # #2792

Checklist:

Docs included if any changes are user facing

coveralls · 2025-08-20T23:46:53Z

Pull Request Test Coverage Report for Build 18470076977

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 54.488%

Totals
Change from base Build 18375028578:	0.0%
Covered Lines:	1214
Relevant Lines:	2228

💛 - Coveralls

andreyvelich

Thanks @akshaychitneni!
I left my initial comments.

pkg/initializers/dataset/cache-initializer-template.yaml

cmd/initializers/dataset/requirements.txt

pkg/initializers/dataset/cache.py

pkg/initializers/dataset/cache_initalizer.py

pkg/initializers/dataset/cache_test.py

andreyvelich · 2025-09-24T09:25:09Z

/milestone v2.1

andreyvelich

Thanks @akshaychitneni, I left a few comments.

cc @kubeflow/kubeflow-trainer-team @rudeigerc appreciate your review as well!

pkg/initializers/dataset/cache_test.py

pkg/initializers/types/types.py

andreyvelich · 2025-09-26T18:11:09Z

pkg/initializers/types/types.py

+@dataclass
+class CacheDatasetInitializer:
+    storage_uri: str
+    train_job_name: str


This also is not needed, the TrainJob name is equal to JobSet name.
We can set the ENV variable in the ClusterTrainingRuntime: TRAIN_JOB_NAME which is getting the value from
metadata.labels['jobset.sigs.k8s.io/jobset-name']

Yes. I plan to set as env var on initializer container using downward api and here I am accessing that env to use trainjob name for ownerRef.

@akshaychitneni If you get TrainJob name from the env, you don't need to set it in the CacheDatasetInitializer.
Similar to namespace:
https://github.com/kubeflow/trainer/pull/2793/files#diff-ed9e751df204997160579feb800b458887ed801b5caab572c0b5142b2e63129bR52

But isn't it similar to fetching from env variable? for namespace, I am not using env var

CacheDatasetInitializer is config that we expose to the end-user. Since TrainJob will be always set via env variable, you don't need to expose this parameter in the config that user can adjust.
Does it make sense @akshaychitneni ?

This is not directly exposed to the users though. we create it from env vars -

trainer/pkg/initializers/utils/utils.py

Line 42 in b918411

def get_config_from_env(config) -> Dict[str, str]:

. And these env vars from be pushed from sdk

I see, so you are going to have TRAIN_JOB_NAME env being set in your ClusterTrainingRuntime, correct ?

andreyvelich · 2025-09-26T18:26:49Z

pkg/initializers/dataset/cache.py

+        logging.info(f"Created LeaderWorkerSet {train_job_name}-cache")
+
+        # Create Service
+        service = client.V1Service(


LWS doesn't have any API to create service automatically ?

Maybe @kerthcet @kannon92 @ahg-g @ardaguclu knows about it ?

I think lws controller doesn't create/manage service objects and it is upto clients to define.

pkg/initializers/dataset/cache.py

andreyvelich

Thanks @akshaychitneni, just left a few comments.
/cc @kubeflow/kubeflow-trainer-team @rudeigerc in case you want to left more comments.

pkg/initializers/dataset/cache_test.py

andreyvelich · 2025-10-01T22:06:08Z

pkg/initializers/dataset/cache_test.py

+    ],
+)
+def test_default_values(test_name, config_values, expected_defaults):


Do you need this test case since you verify cluster creation here:

trainer/pkg/initializers/dataset/cache_test.py

Line 209 in efbebae

def test_create_cache_cluster(test_name, test_case):

Suggested change

],

)

def test_default_values(test_name, config_values, expected_defaults):

Here I am validating config and in test_create_cache_cluster, I am looking for k8s api calls

@akshaychitneni Can you just create another test case in the test_create_cache_cluster function?

andreyvelich · 2025-10-01T22:07:45Z

pkg/initializers/dataset/cache_test.py

@@ -0,0 +1,356 @@
+from unittest.mock import MagicMock, patch


@akshaychitneni Are you going to add integration tests in the future PRs ?

trainer/test/integration/initializers/dataset_test.py

Line 32 in efbebae

"HuggingFace - Invalid dataset",

Yes. I will work on adding integration tests in the future PRs

andreyvelich · 2025-10-01T22:09:52Z

pkg/initializers/dataset/cache.py

+        config_dict = utils.get_config_from_env(types.CacheDatasetInitializer)
+        self.config = types.CacheDatasetInitializer(**config_dict)
+
+    def download_dataset(self):


@akshaychitneni Did you review this ? Even that DatasetProvider interface doesn't have this API, we can still directly call cache.create_cache_cluster() API here:

trainer/pkg/initializers/dataset/__main__.py

Line 34 in efbebae

cache.download_dataset()

pkg/initializers/dataset/cache.py

andreyvelich · 2025-10-07T18:03:58Z

pkg/initializers/dataset/cache_test.py

+                    "worker_cpu": "8",
+                    "worker_mem": "16Gi",
+                },
+                "expected_substitutions": {


Where do you verify expected_substitutions ?

andreyvelich

Thanks @akshaychitneni!
/lgtm
/assign @astefanutti @Electronic-Waste @tenzen-y @rudeigerc

pkg/initializers/dataset/cache.py

astefanutti · 2025-10-09T07:35:23Z

pkg/initializers/dataset/cache.py

+
+        # Get TrainJob for owner reference
+        try:
+            training_job = custom_api.get_namespaced_custom_object(


What gives the permissions to the TrainJob initializer to perform those requests to the API server?

Runtime should be configured with initializer having a serviceAccount with relevant permissions. We plan to document it.

astefanutti · 2025-10-09T07:37:13Z

pkg/initializers/dataset/cache.py

+        schema_name = self.schema_name
+
+        # Load Kubernetes configuration
+        config.load_incluster_config()


I'm not too deep into the design of this, so apologies for the out-of-context comment, but my first reaction is should all this be part of the control plane and not the runtime?

You are right, ideally we should move it to operator, we just didn't get chance to work on this.
@akshaychitneni Maybe as a workaround before building a cache controller, we can use trainer-controller-manager to create LWS with the appropriate spec (e.g. the cache plugin can be activated when storageURI sets as follows: cache://database/table)

Yes, that was the initial plan to add as a plugin to trainer. As we intend to make leverage its own operator we haven't pursed that path. I think we can revisit this approach.

pkg/initializers/dataset/cache.py

astefanutti · 2025-10-09T07:39:08Z

pkg/initializers/dataset/cache.py

+                    annotations={
+                        "eks.amazonaws.com/sts-regional-endpoints": "true",
+                        "eks.amazonaws.com/role-arn": iam_role,
+                    },


Should that be made configurable?

Makes sense.

Our initial implementation only support s3 via iam. I think it is good make this configurable once we support additional providers

Signed-off-by: Akshay Chitneni <achitneni@apple.com>

andreyvelich

I think, we should be good to move this forward.
@akshaychitneni Please create tracking issues for the @astefanutti's suggestions, so we can track them: #2793 (comment)
/lgtm
/approve

google-oss-prow · 2025-10-13T19:52:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow bot requested review from astefanutti and jinchihe August 20, 2025 23:42

google-oss-prow bot added the size/XL label Aug 20, 2025

akshaychitneni changed the title ~~KEP-2655: Adding cache initializer~~ feat - KEP-2655: Adding cache initializer Aug 20, 2025

akshaychitneni force-pushed the cache_initilizer branch from f542e4a to 87003f6 Compare August 20, 2025 23:53

andreyvelich reviewed Aug 21, 2025

View reviewed changes

andreyvelich mentioned this pull request Aug 21, 2025

feat(initializer): add s3 model and dataset initializers #2728

Open

1 task

andreyvelich changed the title ~~feat - KEP-2655: Adding cache initializer~~ feat: KEP-2655: Adding cache initializer Aug 26, 2025

google-oss-prow bot added this to the v2.1 milestone Sep 24, 2025

akshaychitneni force-pushed the cache_initilizer branch 3 times, most recently from 2392316 to e20b1c1 Compare September 25, 2025 20:59

akshaychitneni changed the title ~~feat: KEP-2655: Adding cache initializer~~ feat(cache): KEP-2655: Adding cache initializer Sep 25, 2025

akshaychitneni force-pushed the cache_initilizer branch 3 times, most recently from 5c77408 to e3b9fad Compare September 25, 2025 22:17

andreyvelich reviewed Sep 26, 2025

View reviewed changes

akshaychitneni force-pushed the cache_initilizer branch 4 times, most recently from dface27 to efbebae Compare September 30, 2025 23:20

andreyvelich reviewed Oct 1, 2025

View reviewed changes

akshaychitneni force-pushed the cache_initilizer branch from efbebae to e3a6544 Compare October 7, 2025 17:49

andreyvelich reviewed Oct 7, 2025

View reviewed changes

akshaychitneni force-pushed the cache_initilizer branch 2 times, most recently from 5fc611b to 27c69ec Compare October 8, 2025 19:47

andreyvelich reviewed Oct 8, 2025

View reviewed changes

google-oss-prow bot assigned astefanutti Oct 8, 2025

google-oss-prow bot assigned Electronic-Waste, rudeigerc, tenzen-y and andreyvelich Oct 8, 2025

google-oss-prow bot added the lgtm label Oct 8, 2025

astefanutti reviewed Oct 9, 2025

View reviewed changes

Adding cache initializer

3516259

Signed-off-by: Akshay Chitneni <achitneni@apple.com>

akshaychitneni force-pushed the cache_initilizer branch from 27c69ec to 3516259 Compare October 13, 2025 15:04

google-oss-prow bot removed the lgtm label Oct 13, 2025

andreyvelich reviewed Oct 13, 2025

View reviewed changes

google-oss-prow bot added the lgtm label Oct 13, 2025

google-oss-prow bot added the approved label Oct 13, 2025

google-oss-prow bot merged commit 1581a07 into kubeflow:master Oct 13, 2025
30 of 31 checks passed

This was referenced Oct 14, 2025

feat(trainer): KEP-2655: Support provisioning of cache with Kubeflow SDK kubeflow/sdk#112

Merged

KEP-2655: Support initializer to configure cache #2792

Closed

	],
	)
	def test_default_values(test_name, config_values, expected_defaults):

feat(cache): KEP-2655: Adding cache initializer #2793

feat(cache): KEP-2655: Adding cache initializer #2793

Uh oh!

Conversation

akshaychitneni commented Aug 20, 2025

Uh oh!

coveralls commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 18470076977

Details

💛 - Coveralls

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andreyvelich commented Sep 24, 2025

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akshaychitneni Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coveralls commented Aug 20, 2025 •

edited

Loading

akshaychitneni Oct 7, 2025 •

edited

Loading

andreyvelich Oct 1, 2025 •

edited

Loading

andreyvelich left a comment •

edited

Loading